Amino acid variability, tradeoffs and optimality in human diet

Studies at the molecular level demonstrate that dietary amino acid intake produces substantial effects on health and disease by modulating metabolism. However, how these effects may manifest in human food consumption and dietary patterns is unknown. Here, we develop a series of algorithms to map, characterize and model the landscape of amino acid content in human food, dietary patterns, and individual consumption including relations to health status, covering over 2,000 foods, ten dietary patterns, and over 30,000 dietary profiles. We find that the type of amino acids contained in foods and human consumption is highly dynamic with variability far exceeding that of fat and carbohydrate. Some amino acids positively associate with conditions such as obesity while others contained in the same food negatively link to disease. Using linear programming and machine learning, we show that these health trade-offs can be accounted for to satisfy biochemical constraints in food and human eating patterns to construct a Pareto front in dietary practice, a means of achieving optimality in the face of trade-offs that are commonly considered in economic and evolutionary theories. Thus this study may enable the design of human protein quality intake guidelines based on a quantitative framework.

which has the identifier '01001, Butter, salted' in the USDA food database, in the diet . If 1 = 10, it means that an individual eating diet consumes 10 grams of salted butter per day. Since one cannot consume negative amount of foods, all elements of are nonnegative, i.e. ≥ 0, = 1,2, ⋯ ,2335.

Nutritional values of foods and diets
The nutritional profile of a food is defined by the amount of each nutrient in one gram of the Specifically, we quantify the abundance of amino acids in foods and diets using similar notation. Let = [ 1 2 ⋯ 18] denote the abundances of amino acids in the 2335 foods (i.e. columns of are nutrient abundance vectors for amino acids), we can compute the amino acid composition of each diet by = . The reason that 18 instead of 20 amino acid variables are considered is that the current protocol for quantification of amino acids in foods requires hydrolysis of protein, which breaks protein-bound amino acids into free amino acids, before quantification of the amino acids. During the protein hydrolysis, the amides glutamine and asparagine are converted to glutamic acid and aspartic acid, respectively.

Mathematical definition of human dietary patterns
Human dietary patterns are representative modes of diets consumed in certain geographical regions, consumed by certain ethnic groups, or recommended by certain dietary guidelines for the goal of improving health. Examples of human dietary patterns include Mediterranean diet, Japanese diet, Paleo diet and plant-based diet, which are defined by consumption of certain combinations of foods, or ketogenic diet and Atkins diet, which are defined by limited intake of certain nutrients. Thus, each human dietary pattern can be defined mathematically as a set of constraints on the composition of foods or nutrients in a diet that falls into this category of dietary pattern.

Constraints on foods
This type of constraint limits the total amount of certain food allowed in a diet. These constraints set either an upper bound or a lower bound of daily consumption of the corresponding food or group of foods.
For example, in a plant-based diet, consumption of animal products, such as meat, seafood, egg and dairy products, is strictly prohibited. Thus, for each food, we can define a binary label indicating whether this food is animal-based or not: A general form of this type of constraint can be written as below:

Constraints on absolute levels of nutrient intake
This type of constraint limits the total daily intake of a nutrient in a diet. Similar to the constraints on the amount of foods, these constraints also determine upper or lower bounds of nutrient intake. For instance, in the Atkins diet, which is a carbohydrate-restricted diet, the recommended total daily intake of carbohydrate is no more than 20 grams. As we have discussed in section 1.2, the total daily intake of carbohydrate in a diet is , in which is the vector of carbohydrate contents of foods. Thus, the Atkins diet can be defined by the linear constraint: ≤ 20 This type of constraints can be written in the general form below:

≤ ≤
In which and are the lower and upper bounds of the total daily intake of nutrients related to this diet, and the i-th row in the matrix is the nutrient abundance vector of the i-th of these nutrients.

Constraints on ratios of nutrient intake
This type of constraint determines allowed ranges of ratios between two nutrients. Most of these constraints are about the percentage contribution of a macronutrient, such as carbohydrate, to the total daily intake of calories. For instance, in a ketogenic diet, at least 70% of the total daily calories come from fat, while a very small fraction (e.g. less than 5%) comes from carbohydrate.
Let , and denote the nutrient abundance vectors for carbohydrate, fat and calories, these constraints on the relationship between intake of fat, carbohydrate and calories in a ketogenic diet can be written as below: Rearranging the terms at the two ends of the inequalities, we have: Thus, a general form of this type of constraints is: It is worth noting that all of these constraints are linear.

Quantification of amino acid levels in diets
The goal of this part of analysis is to quantify abundance of amino acids in human diets and identify quantitative amino acid signatures for each dietary pattern such as Mediterranean, American diet and ketogenic diet. For each amino acid, two metrics are used to quantify their enrichment in a diet: one is the absolute daily intake of this amino acid in a diet, the other one is the fraction of this amino acid in the total daily intake of all amino acids. As we have discussed previously, there are 18 variables describing absolute levels of amino acids in foods and diets.
These variables are the amounts of serine, tyrosine, glycine, phenylalanine, proline, valine, lysine, leucine, isoleucine, tryptophan, arginine, methionine, histidine, threonine, alanine, cystine, aspartate + asparagine, and glutamate + glutamine. The units are gram amino acid per gram weight of food (for amino acid levels in foods) and gram amino acid per day (for amino acid levels in diets).

Calculation of ranges of amino acid levels in diets
As we have discussed in section 1.

Sampling of amino acid composition in diets
Amino acid composition of a diet, or relative levels of amino acids in a diet, is defined as a vector that contains ratios of absolute levels of amino acids to total amount of all amino acids in this diet: Since the new sample +1 still needs to be in the feasible region, we have: Thus, the minimal and maximal allowed values for can be determined by solving the linear programming problems:  https://wwwn.cdc.gov/nchs/nhanes/Default.aspx, and converted to R data frames using the function 'sasxport.get()' in the R package 'Hmisc'.

Comparison of data imputation methods
Two methods for missing data imputation, random forest (RF) and predictive mean matching (PMM), were applied to the USDA SR Release 28 dataset to evaluate their performance. RFbased imputation was performed using the function 'missForest()' in the R package 'missForest' 3 . PMM-based imputation was performed using the function 'mice()' in the R package 'MICE' 4 . For each nutrient variable, we first calculated a missing ratio defined by the number of missing values for this variable divided by the total number of foods (i.e. 8788) in the dataset. All nutrient variables with missing ratio higher than 0.6 were discarded before the following analysis. We then adopted two alternative strategies for data transformation before the imputation. In one strategy (without transformation), the raw abundances of nutrients, including amino acids, were used as the input to the data imputation algorithms. In the other strategy, absolute abundances of amino acids were transformed to the ratio of absolute amino acid levels to one plus the protein level:

Imputation of USDA SR and FNDDS datasets
Missing data imputation was first done for nutritional composition data of foods in the USDA SR datasets. Nutrients with missing values in more than 60% of the foods were removed from the datasets before imputation. Amino acid abundances were transformed by dividing one plus the protein abundance as described in the previous section. Imputation was performed using the random forest regression method implemented in the function 'missForest()' in the R package 'missForest' with default parameters. Nutritional composition values of foods in the FNDDS datasets were computed using the imputed USDA SR datasets, the mapping information from foods in SR to foods in FNDDS, factors for moisture and fat adjustments, and retention factors for nutrients.
A typical food in the FNDDS dataset is prepared from one or more foods in the USDA SR dataset with the proportion for each food is provided together with its identifier in USDA SR.
For instance, the FNDDS food with identifier '27520130' and description 'Bacon, chicken, and tomato club sandwich, with lettuce and spread' is mapped to six foods in the USDA SR database:

Supplementary
in which R is a diagonal matrix with the k-th diagonal element being the retention factor for the k-th nutrient in preparation of the i-th FNDDS food from the j-th SR food. A retention factor of 0.1 means that 10% of this nutrient is preserved during the food preparation process while the rest 90% is lost. This nutritional composition vector was then further corrected for moisture and fat adjustments:   Let Z denote the reconstructed nutritional values (which include the reconstructed amino acid abundances in these foods) for foods in FNDDS and x denote a food consumption vector in which the i-th value quantifies the total grams of the i-th food consumed on that day, we can compute the nutrient intake profile, which includes the uptake of amino acids, in this dietary record:

Reconstruction of amino acid intake profiles in NHANES dietary recalls
in which ‖ ‖ 1 is the L1-norm of the food consumption vector (i.e. total weight of food consumed on that day).

Comparison of model-predicted and actual amino acid signatures of ketogenic diet
To quantify the amino acid signature of ketogenic diet according to the NHANES dietary data, for each dietary intake profile of an individual in the NHANES dataset, a ketogenic score quantifying this person's adherence to the ketogenic diet was defined as below: In which f c is the fraction of calories from dietary intake carbohydrate, and f l is the fraction of calories from dietary intake of fat. For the i-th amino acid, we then computed the Spearman's rank correlation coefficient between its intake and the ketogenic score: In which is the vector storing the ketogenic score of all individuals in the NHANES data, is the vector storing the intake of the i-th amino acid of all individuals in the NHANES data, is the Spearman's rank correlation coefficient. The computed correlation coefficients indicate associations between dietary intake of amino acids and adherence to ketogenic diet: amino acids enriched in ketogenic diets will have positive correlation coefficients, while amino acids with lower intake in ketogenic diet will have negative correlation coefficients. Hence, they were able to serve as indicators of amino acid signatures associated with ketogenic diet. This amino acid signature of ketogenic diet in NHANES data was then compared with the amino acid profile of ketogenic diet predicted by our modeling framework using linear programming (defined as the difference between mean amino acid abundance in computationally sampled ketogenic versus other diets). Hypertension was defined as the condition of systolic blood pressure (the variables 'bpxsy1', 'bpxsy2' and 'bpxsy3', corresponding to three consecutive measurements) being higher than 120 mm Hg and diastolic blood pressure (the variables 'bpxdi1', 'bpxdi2' and 'bpxdi3', corresponding to three consecutive measurements) being higher than 80 mm Hg. Diabetes was defined as the condition of glycohemoglobin levels (the variable 'lbxgh') being higher than 6.5%, fasting plasma glucose concentration (the variable 'lbxglu') higher than 126 mg/dL, and blood glucose concentration in response to oral glucose tolerance test (the variable 'lbxglt') higher than 200 mg/dL. Information about the presence of cancer was obtained from answers to the question 'Have you ever been told by a doctor or other health professional that you had cancer or a malignancy of any kind?' in the questionnaire about medical conditions, in which the answers 'yes' or 'no' were linked to the presence or absence of cancer, while the answers 'refused' and 'don't know' were considered as missing data.

Correlation analysis
Partial Spearman correlation coefficients between dietary amino acid composition or other nutritional variables and health variables defined in the previous section were computed using the MATLAB function 'partialcorr', controlling for demographic and lifestyle-related factors including income, education, age, gender, ethnicity, marital status, smoking, alcohol consumption, physical activity, and batch. Only adults (i.e. age > 20 years old) were included in the analysis. Individuals with dietary intake of any nutrient higher than three times of the 99 th percentile of the intake of that nutrient among the population were considered outliers and not included in the following analysis.  Table 4).

Machine learning
Fraction of variables that affect the disease outcome in each group was computed by dividing the number of variables with non-zero regression coefficient by the total number of variables in that group.

Analysis of Pareto optimality
Based on the amino acids positively or negatively associated with obesity incidence that were found in the previous section, we defined two objectives for optimization of dietary amino acid intake, that is, to maximize the total intake of amino acids negatively associated with obesity incidence (i.e. AAs-to-maximize), and to minimize the total intake of amino acids positively associated with obesity incidence (i.e. AAs-to-minimize). Let + and − denote the vectors consisting of total abundance of AAs-to-maximize and AAs-to-minimize in each food, then the inner products + and − indicate the total intake of AAs-to-maximize and AAs-tominimize in a diet . Therefore, we have the mathematical form of optimizing the two amino acid intake goals for a specific dietary pattern with the general form described in section 1.3.4: A feasible solution 0 of this problem is defined as a Pareto solution if for any other feasible solution 1 , − 1 > − 0 if + 1 > + 0 , and + 1 < + 0 if − 1 < − 0 . In other words, for any other feasible solution 1 , it is impossible that 1 has better performance than 0 in both of the two objectives of maximizing total AAs-to-maximize and minimizing total AAs-to-minimize. If the diet 1 has higher total intake of AAs-to-maximize than the diet 0 , then it must have higher total intake of AAs-to-minimize than 0 . On the other hand, if the diet 1 has lower total intake of AAs-to-minimize than the diet 0 , then it must have lower total intake of AAs-to-maximize than 0 . The Pareto surface of the problem is then defined as the set consisting of all Pareto solutions within the feasible region.
To construct the Pareto surface, we applied the -Constraint algorithm. Briefly, for a human dietary pattern with defined mathematical form, we first determine the range of total intake of AAs-to-maximize in that dietary pattern using linear programming, [ , ]: Therefore, for any value ∈ [ , ], a Pareto solution can be obtained by solving the linear programming problem below: The Pareto surface of each dietary pattern was constructed by uniformly selecting 100 values of ∈ [ , ] and computing the corresponding Pareto solution using the method described above.

Estimation of deviation from Pareto surface
For each dietary record in the NHANES dataset, we first computed the total daily intake of AAsto-maximize and AAs-to-minimize in that dietary record: ( , ), and then computed the Euclidean distance between ( , ) and the Pareto surface constructed using the method Since the deviation from the Pareto surface computed this way is largely dependent of the total intake of protein in the dietary record, we adjusted it to the protein intake by fitting a 6-th order polynomial function of protein intake. Let denote the intake of protein in a dietary record (which has the intake of AAs-to-maximize and AAs-to-minimize being ( , ), we fit the data to the model below: ( , ) = ( ) = 0 + 1 + 2 2 + 3 3 + 4 4 + 5 5 + 6 6 The residue ̂= ( , ) − ( ) was then defined as the adjusted deviation from Pareto surface, which was used in all downstream analyses related to deviation from Pareto surface (Supplementary Figure 12).

Supplementary References
human dietary records in the NHANES datasets.
(c) Prevalence of diabetes in human subjects with different levels of intake of the amino acids positively or negatively associated with diabetes. The p-values were calculated by two-sided chi-squared test.
(d) Associations between the diabetes prevalence and deviation of dietary intake profiles from the Pareto surface. Chi-squared p-values were computed to assess the significance levels of the associations The p-values were calculated by two-sided chi-squared test. Sample size n = 18 for single amino acids, n = 10 for deviation from Pareto surface.